Version: Next

Classification - Instacart Product Reorder Prediction

Predicting whether a customer will reorder products in their next order using the Instacart Market Basket Analysis dataset.

Dataset Source: Kaggle Instacart Market Basket Analysis Problem Type: Classification Target Variable: Binary prediction of whether a product will be reordered (1) or not (0) Use Case: E-commerce recommendation systems, inventory management, personalized marketing campaigns

Package Imports

Xplainable Cloud Setup

!pip install xplainable
!pip install xplainable-client

import pandas as pd
import xplainable as xp
from xplainable.core.models import XClassifier
from xplainable.core.optimisation.bayesian import XParamOptimiser
from xplainable.preprocessing.pipeline import XPipeline
from xplainable.preprocessing import transformers as xtf
from sklearn.model_selection import train_test_split
import requests
import json

# Additional imports specific to this example
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc

import xplainable_client

# Initialize Xplainable Cloud client
client = xplainable_client.Client(
    api_key="",  # Add your API key from https://platform.xplainable.io/
)

Data Loading and Exploration

Load the Instacart Market Basket Analysis datasets.

Note: Download the datasets from Kaggle and extract the CSV files, or use the direct download links below.

Dataset Download Instructions

Option 1: Direct Download (Recommended for beginners)

Visit the Kaggle Instacart Market Basket Analysis competition
Create a free Kaggle account if you don't have one
Download the dataset ZIP file (~200MB)
Extract all CSV files to your working directory

Option 2: Kaggle API (Recommended for experienced users)

pip install kaggle
kaggle competitions download -c instacart-market-basket-analysis
unzip instacart-market-basket-analysis.zip

Note: The dataset contains 6 CSV files with over 3 million orders and 32 million order products.

# Load datasets (update paths if files are in a different directory)
try:
    orders = pd.read_csv('orders.csv')
    order_products_train = pd.read_csv('order_products__train.csv')
    order_products_prior = pd.read_csv('order_products__prior.csv')
    products = pd.read_csv('products.csv')
    aisles = pd.read_csv('aisles.csv')
    departments = pd.read_csv('departments.csv')
    
    # Display dataset information
    print("Datasets loaded successfully!")
    print(f"Orders: {orders.shape}")
    print(f"Order Products (Train): {order_products_train.shape}")
    print(f"Order Products (Prior): {order_products_prior.shape}")
    print(f"Products: {products.shape}")
    print(f"Aisles: {aisles.shape}")
    print(f"Departments: {departments.shape}")
    
except FileNotFoundError as e:
    print(f"Error loading files: {e}")
    print("Please ensure you have downloaded and extracted the Instacart dataset files.")
    print("Files should be in the same directory as this notebook.")

Inspecting orders dataset

# Get heads for the orders dataset
orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	nan
1	2398795	1	prior	2	3	7	15
2	473747	1	prior	3	3	12	21
3	2254736	1	prior	4	4	7	29
4	431534	1	prior	5	4	15	28

# Get shape of orders dataset
orders.shape

Out:

(3421083, 7)

# Get a brief descriptive info on orders
orders.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
#   Column                  Dtype  
---  ------                  -----  
0   order_id                int64  
1   user_id                 int64  
2   eval_set                object 
3   order_number            int64  
4   order_dow               int64  
5   order_hour_of_day       int64  
6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB

# Get missing values in orders dataset
orders.isnull().sum()

Out:

order_id 0

user_id 0

eval_set 0

order_number 0

order_dow 0

order_hour_of_day 0

days_since_prior_order 206209

dtype: int64

We have observed that there are 206209 missing values in days_since_prior_order column.

Inspecting order_products_train dataset

# Get heads for the order_products_train dataset
order_products_train.head()

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

# Get shape of order_products_train dataset
order_products_train.shape

Out:

(1384617, 4)

# Get missing values in order_products_train dataset
order_products_train.isnull().sum()

Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspecting order_products_prior dataset

# Get head for order_products_prior
order_products_prior.head()

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

# Get shape for order_products_prior
order_products_prior.shape

Out:

(32434489, 4)

# Get missing value for order_products_prior
order_products_prior.isnull().sum()

Out:

order_id 0

product_id 0

add_to_cart_order 0

reordered 0

dtype: int64

Inspect products dataset

# Get heads for the products dataset
products.head()

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

#  Get shape for products
products.shape

Out:

(49688, 4)

# Get missing value for products
products.isnull().sum()

Out:

product_id 0

product_name 0

aisle_id 0

department_id 0

dtype: int64

Inspecting aisles dataset

# Get head for aisles
aisles.head()

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

# Get shape for aisles
aisles.shape

Out:

(134, 2)

# Check missing values in aisle
aisles.isnull().sum()

Out:

aisle_id 0

aisle 0

dtype: int64

Inspecting departments dataset

# Get head for departments
departments.head()

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol

# Get shape for departments
departments.shape

Out:

(21, 2)

# Check missing values in departments
departments.isnull().sum()

Out:

department_id 0

department 0

dtype: int64

Exploratory Data Analysis (EDA)

# Get the number of orders in each days of a week
plt.figure(figsize=(6,4))
sns.countplot(x="order_dow", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Day of week', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by week day", fontsize=15)
plt.show()

The number of orders on weekends is more compared to weekdays as people stay at home and might have wanted to enjoy the foods.

# Get the number of orders for each hour in a day
plt.figure(figsize=(6,4))
sns.countplot(x="order_hour_of_day", data=orders, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Hour of day', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Orders by Hour of day", fontsize=15)
plt.show()

Peak hours wheremaximum orders are done is between 9 AM- 5PM. Less orders are placed before 7AM and after 11 PM.

# Analyse how people orders a new one since last order
plt.figure(figsize=(10,6))
sns.countplot(orders['days_since_prior_order'])
plt.xticks(rotation=90)
plt.show()

Maximum number of users order again after 1 month. People also order after a week and this forms the second largest order habit.

# Merge products and departments dataframes and then merging with aisles
products_details = pd.merge(left=products,right=departments,how="left")
products_details = pd.merge(left=products_details,right=aisles,how="left")
products_details.head()

	product_id	product_name	aisle_id	department_id	department	aisle
0	1	Chocolate Sandwich Cookies	61	19	snacks	cookies cakes
1	2	All-Seasons Salt	104	13	pantry	spices seasonings
2	3	Robust Golden Unsweetened Oolong Tea	94	7	beverages	tea
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	frozen	frozen meals
4	5	Green Chile Anytime Sauce	5	13	pantry	marinades meat preparation

# Get the number of products in each department
plt.figure(figsize=(10,6))
g=sns.countplot(x="department",data=products_details)
g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
plt.show()

Personal care is the most abundant type of department available followed by snacks.

# Get top 10 aisle with most number of products
plt.figure(figsize=(10,6))
top10_aisle=products_details["aisle"].value_counts()[:10].plot(kind="bar",title='Aisles')

missing is the aisle with most products available.

# Merge order_products_train and products dataframes
order_products_name_train = pd.merge(left=order_products_train,right=products.loc[:,["product_id","product_name"]],on="product_id",how="left")

# Get the top 10 common products which are most bought by the customers
common_Products=order_products_name_train[order_products_name_train.reordered == 1]["product_name"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="product_name", y="index", data=common_Products.head(10))
plt.ylabel('product_name', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Banana is the most common type of product bought by people followed by Bag of organic banana.

# Merge order_products_name_train and products_details dataframes
order_products_name_train = pd.merge(left=order_products_name_train,right=products_details.loc[:,["product_id","aisle","department"]],on="product_id",how="left")

#  Get the aisles which have most number of sales
common_aisle=order_products_name_train["aisle"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="aisle", y="index", data=common_aisle.head(10),palette="Blues_d")
plt.ylabel('aisle', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

Fresh vegetable aisle has the highest number of sales followed by fresh_fruits.

#  Get the departments which have most number of sales
common_aisle=order_products_name_train["department"].value_counts().to_frame().reset_index()
plt.figure(figsize=(12,7))
plt.xticks(rotation=90)
sns.barplot(x="department", y="index", data=common_aisle,palette="Blues_d")
plt.ylabel('department', fontsize=12)
plt.xlabel('count', fontsize=12)
plt.show()

produce and dairy eggs are the top 2 departments with the highest number of sales.

# Get the products which were reordered in each order from the train data.
train_data_reordered = order_products_train.groupby(["order_id","reordered"])["product_id"].apply(list).reset_index()
train_data_reordered = train_data_reordered[train_data_reordered.reordered == 1].drop(columns=["reordered"]).reset_index(drop=True)
train_data_reordered.head()

	order_id	product_id
0	1	[49302, 11109, 43633, 22035]
1	36	[19660, 43086, 46620, 34497, 48679, 46979]
2	38	[21616]
3	96	[20574, 40706, 27966, 24489, 39275]
4	98	[8859, 19731, 43654, 13176, 4357, 37664, 34065...

1. Data Preprocessing

Feature Engineering and Data Preparation

Create features for user behavior, product characteristics, and user-product interactions.

# Delete all unnncessary dataframes
del products_details
del order_products_name_train
del common_Products
del common_aisle
del train_data_reordered
gc.collect()

Out:

# Get 15% of users as it is a huge dataset and will take lots of time to train.
orders = orders.loc[orders.user_id.isin(orders.user_id.drop_duplicates().sample(frac=0.15, random_state=101))] 

# Convert character variables into category. 
aisles['aisle'] = aisles['aisle'].astype('category')
departments['department'] = departments['department'].astype('category')
orders['eval_set'] = orders['eval_set'].astype('category')
products['product_name'] = products['product_name'].astype('category')

# Merge orders and order_products_prior datasets to get prior order dataset
prior_orders = pd.merge(orders, order_products_prior, on='order_id', how='inner')
prior_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order
0	2539329	1	prior	1	2	8	nan	196	1
1	2539329	1	prior	1	2	8	nan	14084	2
2	2539329	1	prior	1	2	8	nan	12427	3
3	2539329	1	prior	1	2	8	nan	26088	4
4	2539329	1	prior	1	2	8	nan	26405	5

Create Features using user_id

# Create a feature based on number of orders placed by each user.
users = prior_orders.groupby(by='user_id')['order_number'].aggregate('max').to_frame('num_of_orders_for_each_user').reset_index()
users.head()

	user_id	num_of_orders_for_each_user
0	1	10
1	2	14
2	3	12
3	4	5
4	5	4

# Get average number of products in  each orders placed by each users.

# Get the total number of products in each order.
toal_product_per_order = prior_orders.groupby(by=['user_id', 'order_id'])['product_id'].aggregate('count').to_frame('total_products_per_order').reset_index()

# Create a dataframe to get the average number of products purchased in  each orders by each user
avg_number_of_products_per_order = toal_product_per_order.groupby(by=['user_id'])['total_products_per_order'].mean().to_frame('avg_no_prd_per_order').reset_index()

# Delete unnecessay the toal_product_per_order dataframe
del [toal_product_per_order]
gc.collect()

#  Get head of avg_number_of_products_per_order
avg_number_of_products_per_order.head()

	user_id	avg_no_prd_per_order
0	1	5.9
1	2	13.9286
2	3	7.33333
3	4	3.6
4	5	9.25

from scipy import stats
import pandas as pd
import numpy as np

def calculate_mode(x):
    if len(x) > 0:
        mode_result = stats.mode(x)
        # Check if mode_result.mode is an array and has at least one element
        if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
            return mode_result.mode[0]
        else:
            return mode_result.mode
    else:
        return pd.NA

# Create a dataframe for the day of the week where users order most
order_most_dow = prior_orders.groupby(by=['user_id'])['order_dow'].aggregate(calculate_mode).to_frame('dow_with_most_orders').reset_index()

# Get head of the dataset
order_most_dow.head()

	user_id	dow_with_most_orders
0	1	4
1	2	2
2	3	0
3	4	4
4	5	3

def calculate_mode_hour(x):
    if len(x) > 0:
        mode_result = stats.mode(x)
        # Check if mode_result.mode is an array and has at least one element
        if isinstance(mode_result.mode, np.ndarray) and mode_result.mode.size > 0:
            return mode_result.mode[0]
        else:
            return mode_result.mode
    else:
        return pd.NA

# Create a dataframe for the hour of day where users have ordered most
order_most_hod = prior_orders.groupby(by=['user_id'])['order_hour_of_day'].aggregate(calculate_mode_hour).to_frame('hod_with_most_orders').reset_index()

# Display the first few rows of the dataframe
order_most_hod.head()

	user_id	hod_with_most_orders
0	1	7
1	2	9
2	3	16
3	4	15
4	5	18

#Get a dataframe for reorder ratio of each user and set data type as float
user_reorder_ratio = prior_orders.groupby(by='user_id')['reordered'].aggregate('mean').to_frame('reorder_ratio').reset_index()
user_reorder_ratio['reorder_ratio'] = user_reorder_ratio['reorder_ratio'].astype(np.float16)
user_reorder_ratio.head()

	user_id	reorder_ratio
0	1	0.694824
1	2	0.476807
2	3	0.625
3	4	0.055542
4	5	0.378418

# Merging all the created user based features into the users dataset one by one.
users = users.merge(avg_number_of_products_per_order, on='user_id', how='left')
users = users.merge(order_most_dow, on='user_id', how='left')
users = users.merge(order_most_hod, on='user_id', how='left')
users = users.merge(user_reorder_ratio, on='user_id', how='left')

users.head()

	user_id	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio
0	1	10	5.9	5.9	4	7	0.694824
1	2	14	13.9286	13.9286	2	9	0.476807
2	3	12	7.33333	7.33333	0	16	0.625
3	4	5	3.6	3.6	4	15	0.055542
4	5	4	9.25	9.25	3	18	0.378418

# Delete unnecessay dataframes
del [avg_number_of_products_per_order,order_most_dow,order_most_hod,user_reorder_ratio]
gc.collect()

Out:

Create features using product_id.

#Get a dataframe to show the number of times a product has been purchased.
purchased_num_of_times = prior_orders.groupby(by='product_id')['order_id'].aggregate('count').to_frame('purchased_num_of_times').reset_index()
purchased_num_of_times.head()

	product_id	purchased_num_of_times
0	1	1852
1	2	90
2	3	277
3	4	329
4	5	15

#Get a dataframe for the reordered ratio for each product
product_reorder_ratio = prior_orders.groupby(by='product_id')['reordered'].aggregate('mean').to_frame('product_reorder_ratio').reset_index()
product_reorder_ratio.head()

	product_id	product_reorder_ratio
0	1	0.613391
1	2	0.133333
2	3	0.732852
3	4	0.446809
4	5	0.6

#Get a dataframe for avearage number of adding to cart for each product.
add_to_cart = prior_orders.groupby(by='product_id')['add_to_cart_order'].aggregate('mean').to_frame('product_avg_cart_addition').reset_index()
add_to_cart.head()

	product_id	product_avg_cart_addition
0	1	5.80184
1	2	9.88889
2	3	6.41516
3	4	9.5076
4	5	6.46667

# Merge all the created features based on product_id into the purchased_num_of_times dataset.
purchased_num_of_times = purchased_num_of_times.merge(product_reorder_ratio, on='product_id', how='left')
purchased_num_of_times = purchased_num_of_times.merge(add_to_cart, on='product_id', how='left')

#Delete unwanted dataframes.
del [product_reorder_ratio, add_to_cart]
gc.collect()

Out:

# Get head of purchased_num_of_times
purchased_num_of_times.head()

	product_id	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition
0	1	1852	0.613391	5.80184
1	2	90	0.133333	9.88889
2	3	277	0.732852	6.41516
3	4	329	0.446809	9.5076
4	5	15	0.6	6.46667

Creating features using user_id and product_id

#Create a user_product dataframe which shows the number of times a user have bough a product.
user_product_data = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_times_bought').reset_index()
user_product_data.head()

	user_id	product_id	uxp_times_bought
0	1	196	10
1	1	10258	9
2	1	10326	1
3	1	12427	10
4	1	13032	3

#Create a dataframe  to find a product's order number when the user has bought a product for the first time.
product_first_order_num = prior_orders.groupby(by=['user_id', 'product_id'])['order_number'].aggregate('min').to_frame('first_order_number').reset_index()
product_first_order_num.head()

	user_id	product_id	first_order_number
0	1	196	1
1	1	10258	2
2	1	10326	5
3	1	12427	1
4	1	13032	2

#Get total number of orders by each user
total_orders = prior_orders.groupby('user_id')['order_number'].max().to_frame('total_orders').reset_index()
total_orders.head()

	user_id	total_orders
0	1	10
1	2	14
2	3	12
3	4	5
4	5	4

# Merge total_orders and user_product_data dataframes to create a new dataframe user_product_df
user_product_df = pd.merge(total_orders, product_first_order_num, on='user_id', how='right')
user_product_df.head()

	user_id	total_orders	product_id	first_order_number
0	1	10	196	1
1	1	10	10258	2
2	1	10	10326	5
3	1	10	12427	1
4	1	10	13032	2

# Calculate the order range.
# The +1 includes in the difference is the first order where the product has been purchased
user_product_df['order_range'] = user_product_df['total_orders'] - user_product_df['first_order_number'] + 1
user_product_df.head()

	user_id	total_orders	product_id	first_order_number	order_range
0	1	10	196	1	10
1	1	10	10258	2	9
2	1	10	10326	5	6
3	1	10	12427	1	10
4	1	10	13032	2	9

#Create  a dataframe to show the number of times a user have bough a product.
number_of_times = prior_orders.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('times_bought').reset_index()
number_of_times.head()

	user_id	product_id	times_bought
0	1	196	10
1	1	10258	9
2	1	10326	1
3	1	12427	10
4	1	13032	3

# Merging number_of_times with user_product_df
uxp_ratio = pd.merge(number_of_times, user_product_df, on=['user_id', 'product_id'], how='left')
uxp_ratio.head()

	user_id	product_id	times_bought	total_orders	first_order_number	order_range
0	1	196	10	10	1	10
1	1	10258	9	10	2	9
2	1	10326	1	10	5	6
3	1	12427	10	10	1	10
4	1	13032	3	10	2	9

# Get a dataframe to calculate the reorder ratio for each product
uxp_ratio['uxp_reorder_ratio'] = uxp_ratio['times_bought'] / uxp_ratio['order_range']
uxp_ratio.head()

	user_id	product_id	times_bought	total_orders	first_order_number	order_range	uxp_reorder_ratio
0	1	196	10	10	1	10	1
1	1	10258	9	10	2	9	1
2	1	10326	1	10	5	6	0.166667
3	1	12427	10	10	1	10	1
4	1	13032	3	10	2	9	0.333333

#Drop all the unnecessary columns from uxp_ratio dataframe .
uxp_ratio.drop(['times_bought', 'total_orders', 'first_order_number', 'order_range'], axis=1, inplace=True)
uxp_ratio.head()

	user_id	product_id	uxp_reorder_ratio
0	1	196	1
1	1	10258	1
2	1	10326	0.166667
3	1	12427	1
4	1	13032	0.333333

#Merge uxp_ratio with user_product_data.
user_product_data = user_product_data.merge(uxp_ratio, on=['user_id', 'product_id'], how='left')

# Delete all unnecessay datasets
del [product_first_order_num, number_of_times,user_product_df,total_orders, uxp_ratio]
gc.collect()

Out:

# Get head for user_product_data
user_product_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio
0	1	196	10	1
1	1	10258	9	1
2	1	10326	1	0.166667
3	1	12427	10	1
4	1	13032	3	0.333333

#Create a column order_number_back to reverse the order number for each product in prior_orders dataframe
prior_orders['order_number_back'] = prior_orders.groupby(by=['user_id'])['order_number'].transform(max) - prior_orders.order_number + 1
prior_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	order_number_back
0	2539329	1	prior	1	2	8	nan	196	1	10
1	2539329	1	prior	1	2	8	nan	14084	2	10
2	2539329	1	prior	1	2	8	nan	12427	3	10
3	2539329	1	prior	1	2	8	nan	26088	4	10
4	2539329	1	prior	1	2	8	nan	26405	5	10

#Update the dataframe to keep only the first 3 orders from the order_number_back.
temp_df = prior_orders.loc[prior_orders.order_number_back <= 3]
temp_df.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	order_number_back
38	3108588	1	prior	8	1	14	14	12427	1	1	3
39	3108588	1	prior	8	1	14	14	196	2	1	3
40	3108588	1	prior	8	1	14	14	10258	3	1	3
41	3108588	1	prior	8	1	14	14	25133	4	1	3
42	3108588	1	prior	8	1	14	14	46149	5	0	3

#Get the products bought by users in the last 3 orders.
last_three_order = temp_df.groupby(by=['user_id', 'product_id'])['order_id'].aggregate('count').to_frame('uxp_last_three').reset_index()
last_three_order.head()

	user_id	product_id	uxp_last_three
0	1	196	3
1	1	10258	3
2	1	12427	3
3	1	13032	1
4	1	25133	3

#Get the ratio of the products bought in the last 3 orders.
last_three_order['uxp_ratio_last_three'] = last_three_order['uxp_last_three'] / 3
last_three_order.head()

	user_id	product_id	uxp_last_three	uxp_ratio_last_three
0	1	196	3	1
1	1	10258	3	1
2	1	12427	3	1
3	1	13032	1	0.333333
4	1	25133	3	1

#Merge last_three_order with feature with uxp df.
user_product_data = user_product_data.merge(last_three_order, on=['user_id', 'product_id'], how='left')

# Delete unwanted dataframes
del [last_three_order, temp_df]
gc.collect()

Out:

# Get the head of updated dataframe user_product_data.head()
user_product_data.head().head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three
0	1	196	10	1	3	1
1	1	10258	9	1	3	1
2	1	10326	1	0.166667	nan	nan
3	1	12427	10	1	3	1
4	1	13032	3	0.333333	1	0.333333

# Check any missing values in user_product_data columns
user_product_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 8382738

uxp_ratio_last_three 8382738

dtype: int64

#Fill the NAN values with 0 in user_product_data.
user_product_data.fillna(0, inplace=True)

# Confirm filling of missing values in user_product_data columns
user_product_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

dtype: int64

Create final dataframe for engineered features

#  Merge user_product_data and users, then resulting dataframe with purchased_num_of_times to get featured_engineered_data dataset
featured_engineered_data = user_product_data.merge(users, on='user_id', how='left')
featured_engineered_data = featured_engineered_data.merge(purchased_num_of_times, on='product_id', how='left')

# Delete unncessary dataframes.
del [users, user_product_data, purchased_num_of_times]
gc.collect()

# Get head of featured_engineered_data
featured_engineered_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277

# Check if any missing values found in featured_engineered_data dataframe columns
featured_engineered_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

dtype: int64

Creating Train and Test datasets

Create training dataset

# Keep only the future orders from all customers i.e. train  and test orders
orders_future = orders[((orders.eval_set=='train') | (orders.eval_set=='test'))]
orders_future = orders_future[['user_id', 'eval_set', 'order_id']]

# merge the orders_future with featured_engineered_data to create a final dataframe.
final_data = featured_engineered_data.merge(orders_future, on='user_id', how='left')
final_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899

#Create training data set.
train_data = final_data[final_data.eval_set=='train']
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899

#Merge order_products__train  with into train_data datframe.
train_data = train_data.merge(order_products_train[['product_id', 'order_id', 'reordered']], on=['product_id', 'order_id'], how='left')
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id	reordered
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899	1
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899	1
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899	nan
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899	nan
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899	1

# Check if any missing values found in train_data dataframe columns
train_data.isnull().sum()

Out:

user_id 0

product_id 0

uxp_times_bought 0

uxp_reorder_ratio 0

uxp_last_three 0

uxp_ratio_last_three 0

num_of_orders_for_each_user 0

avg_no_prd_per_order_x 0

avg_no_prd_per_order_y 0

dow_with_most_orders 0

hod_with_most_orders 0

reorder_ratio 0

purchased_num_of_times 0

product_reorder_ratio 0

product_avg_cart_addition 0

eval_set 0

order_id 0

reordered 7645837

dtype: int64

# Fill the missing values in reordered column with 0
train_data['reordered'] = train_data['reordered'].fillna(0)
train_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	dow_with_most_orders	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id	reordered
0	1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	train	1187899	1
1	1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	train	1187899	1
2	1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	train	1187899	0
3	1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	train	1187899	0
4	1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	train	1187899	1

# Set user_id and product_id as the index of the train_data
train_data = train_data.set_index(['user_id', 'product_id'])

# Drop unwanted columns from train_data dataframe
train_data = train_data.drop(['eval_set', 'order_id'], axis=1)

# Get head of train_data
train_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	dow_with_most_orders_Unnamed: 9_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1	reordered_Unnamed: 15_level_1
1	196	10	1	3	1	10	5.9	5.9	4	7	0.694824	35791	0.77648	3.72177	1
1	10258	9	1	3	1	10	5.9	5.9	4	7	0.694824	1946	0.713772	4.27749	1
1	10326	1	0.166667	0	0	10	5.9	5.9	4	7	0.694824	5526	0.652009	4.1911	0
1	12427	10	1	3	1	10	5.9	5.9	4	7	0.694824	6476	0.740735	4.76004	0
1	13032	3	0.333333	1	0.333333	10	5.9	5.9	4	7	0.694824	3751	0.657158	5.62277	1

Create testing dataset

#Keep only the future orders labelled as test
test_data = final_data[final_data.eval_set=='test']
test_data.head()

	user_id	product_id	uxp_times_bought	uxp_reorder_ratio	uxp_last_three	uxp_ratio_last_three	num_of_orders_for_each_user	avg_no_prd_per_order_x	avg_no_prd_per_order_y	hod_with_most_orders	reorder_ratio	purchased_num_of_times	product_reorder_ratio	product_avg_cart_addition	eval_set	order_id
120	3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208	test	2774568
121	3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892	test	2774568
122	3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754	test	2774568
123	3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738	test	2774568
124	3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285	test	2774568

# Set user_id and product_id as the index of the train_data
test_data = test_data.set_index(['user_id', 'product_id'])

# Drop unwanted columns from train_data dataframe
test_data = test_data.drop(['eval_set', 'order_id'], axis=1)

# Get head of test_data 
test_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1
3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208
3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892
3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754
3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738
3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285

#Delete unnecessay dataframes
del [final_data, orders_future, products, order_products_train]
gc.collect()

Out:

Building model using Xplainable Classifier

Build X_train and y_train dataset

# Define X_train and y_train
X_train, y_train = train_data.drop('reordered', axis=1), train_data.reordered

2. Model Optimization

Optimize hyperparameters using a subset of the data for computational efficiency.

from xplainable.core.optimisation.bayesian import XParamOptimiser

opt = XParamOptimiser()
params = opt.optimise(X_train[:1000000], y_train[:1000000])

Out:

100%|████████| 30/30 [00:45<00:00,  1.53s/trial, best loss: -0.8107380819617527]

3. Model Training

Train the XClassifier with optimized parameters on the full training dataset.

from xplainable.core.models import XClassifier

model = XClassifier(**params)
model.fit(X_train, y_train)

Out:

<xplainable.core.ml.classification.XClassifier at 0x2a426f760>

4. Model Interpretability and Explainability

Analyze feature importance and model decision-making for the product reorder predictions.

model.explain()

In the Feature Importances section, we see a list of features with corresponding importance values. The feature uxp_reorder_ratio has the highest importance, indicating that it is the most influential factor in the model's predictions.

On the Contributions side, the uxp_reorder_ratio feature also shows a notable contribution to the model's output. The green bars represent positive contributions, while the red bars indicate negative contributions. The specific contribution values are again not directly visible, but the length and color of the bars suggest that uxp_reorder_ratio has a strong positive influence on the model's predictions.

5. Model Persisting (Optional)

Save the trained model to the Xplainable platform for future use and deployment.

# Uncomment the following lines to save the model to Xplainable platform
# model_id = client.create_model(
#     model=model,
#     model_name="Instacart Product Reorder Prediction",
#     model_description="Binary classification model predicting whether a customer will reorder products in their next order using the Instacart Market Basket Analysis dataset",
#     x=X_train,
#     y=y_train
# )
# print(f"Model saved with ID: {model_id}")

print("Model persistence step skipped - uncomment above code to save model")

6. Model Deployment (Optional)

Deploy the model to the Xplainable platform for real-time predictions.

# Uncomment the following lines to deploy the model
# deployment = client.deploy(
#     model_version_id=model_id["version_id"]
# )
# print(f"Model deployed: {deployment}")

print("Model deployment step skipped - uncomment above code to deploy model")

7. Model Testing

Create model predictions using threshold cutoff.

NOTE: Adjust the threshold cutoff to see the impact on the result

# Get model predictions with threshold of 0.21 probability
test_prediction = (model.predict_proba(test_data) >= 0.21).astype(int)
test_prediction[:5]

Out:

array([0, 0, 0, 0, 0])

train_prediction = (model.predict_proba(X_train) >= 0.21).astype(int)
train_prediction[:5]

Out:

array([1, 1, 0, 1, 0])

# Import evaluation matrices
from sklearn.metrics import f1_score, classification_report

# Get f1 score and classification report
print(f'f1 Score: {f1_score(train_prediction, y_train)}')
print(classification_report(train_prediction, y_train))

Out:

f1 Score: 0.41808008442097677

precision recall f1-score support

0 0.92 0.94 0.93 7520042

1 0.45 0.39 0.42 954619

accuracy 0.88 8474661

macro avg 0.69 0.66 0.67 8474661

weighted avg 0.87 0.88 0.87 8474661

#Create the prediction as a new column in test_data
test_data['prediction'] = test_prediction
test_data.head()

		uxp_times_bought_Unnamed: 2_level_1	uxp_reorder_ratio_Unnamed: 3_level_1	uxp_last_three_Unnamed: 4_level_1	uxp_ratio_last_three_Unnamed: 5_level_1	num_of_orders_for_each_user_Unnamed: 6_level_1	avg_no_prd_per_order_x_Unnamed: 7_level_1	avg_no_prd_per_order_y_Unnamed: 8_level_1	hod_with_most_orders_Unnamed: 10_level_1	reorder_ratio_Unnamed: 11_level_1	purchased_num_of_times_Unnamed: 12_level_1	product_reorder_ratio_Unnamed: 13_level_1	product_avg_cart_addition_Unnamed: 14_level_1
3	248	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	6371	0.400251	10.6208
3	1005	1	0.333333	1	0.333333	12	7.33333	7.33333	16	0.625	463	0.440605	9.49892
3	1819	3	0.333333	0	0	12	7.33333	7.33333	16	0.625	2424	0.492162	9.28754
3	7503	1	0.1	0	0	12	7.33333	7.33333	16	0.625	12474	0.553551	9.54738
3	8021	1	0.090909	0	0	12	7.33333	7.33333	16	0.625	27864	0.591157	8.82285

# Reset the index and create a dataset called final_df
final_df = test_data.reset_index()

# Keeping only the required columns to create  submission file
final_df = final_df[['product_id', 'user_id', 'prediction']]

# Collect garbage and show head of final_df
gc.collect()
final_df.head()

	product_id	user_id
0	248	3
1	1005	3
2	1819	3
3	7503	3
4	8021	3

Creating the Kaggle submission file (optional)

After developing a robust model and ensuring its performance on our validation set, the next step is to prepare our submission for Kaggle. Although this step is optional, it is a good practice to understand how to create a submission file that adheres to the competition's requirements.

To create a submission file, you typically need to:

Ensure that your model has been trained with the full training set or with an appropriate cross-validation strategy.
Generate predictions for the test set provided by Kaggle.
Format these predictions into a CSV file that matches the submission format of the competition, which usually involves setting the index to an id column and including a column with your predictions.
Use the to_csv() function from pandas with the appropriate parameters, such as index=False if the index should not be included in the submission file, to save your dataframe to a CSV file.
Upload this CSV file to the Kaggle competition's submission page to see how your model performs on the unseen test set.

See specific steps for the kaggle upload below

#Create  a new dataframe orders_test 
orders_test = orders.loc[orders.eval_set == 'test', ['user_id', 'order_id']]
orders_test.head()

	user_id	order_id
38	3	2.77457e+06
44	4	329954
53	6	1.52801e+06
96	11	1.37694e+06
102	12	1.35684e+06

#Merge  final_df with orders_test daatframe
final_df = final_df.merge(orders_test, on='user_id', how='left')
final_df.head()

	product_id	user_id	order_id
0	248	3	2.77457e+06
1	1005	3	2.77457e+06
2	1819	3	2.77457e+06
3	7503	3	2.77457e+06
4	8021	3	2.77457e+06

# Remove user_id column and convert product_id as integer
final_df = final_df.drop('user_id', axis=1)
final_df['product_id'] = final_df.product_id.astype(int)
final_df.head()

	product_id	order_id
0	248	2.77457e+06
1	1005	2.77457e+06
2	1819	2.77457e+06
3	7503	2.77457e+06
4	8021	2.77457e+06

# Create a dictionary to store product IDs whose reordered prediction value is 1
final_dict = dict()
for row in final_df.itertuples():
    if row.prediction== 1:
        try:
            final_dict[row.order_id] += ' ' + str(row.product_id)
        except:
            final_dict[row.order_id] = str(row.product_id)

# Update products whose reorder prediction value is 0 as None          
for order in final_df.order_id:
    if order not in final_dict:
        final_dict[order] = 'None'
 
# Collect garbage
gc.collect()

Out:

# Convert the final_dict dictionary into a final submission dataframe called submission_df
submission_df = pd.DataFrame.from_dict(final_dict, orient='index')

# Reset index
submission_df.reset_index(inplace=True)

#Set column names
submission_df.columns = ['order_id', 'products']

# Get head
submission_df.head()

	order_id	products
0	2774568	17668 18599 21903 39190 43961 47766
1	1528013	21903 38293
2	1376945	8309 13176 14947 20383 27959 33572 35948 44632
3	1356845	5746 7076 8239 10863 11520 13176 14992
4	2161313	196 10441 11266 12427 14715 27839 37710

# Create the final submission file
submission_df.to_csv('sub.csv', index=False, header=True)

submission_df.shape

Out:

(75000, 2)

Classification - Instacart Product Reorder Prediction

Package Imports​

Xplainable Cloud Setup​

Data Loading and Exploration​

Dataset Download Instructions​

Exploratory Data Analysis (EDA)

1. Data Preprocessing​

Feature Engineering and Data Preparation​

Create Features using user_id​

Create features using product_id.​

Creating features using user_id and product_id​

Create final dataframe for engineered features​

Creating Train and Test datasets

Create training dataset​

Create testing dataset​

Building model using Xplainable Classifier

Build X_train and y_train dataset​

2. Model Optimization​

3. Model Training​

4. Model Interpretability and Explainability​

5. Model Persisting (Optional)​

6. Model Deployment (Optional)​

7. Model Testing​

Creating the Kaggle submission file (optional)

Package Imports

Xplainable Cloud Setup

Data Loading and Exploration

Dataset Download Instructions

1. Data Preprocessing

Feature Engineering and Data Preparation

Create Features using user_id

Create features using product_id.

Creating features using user_id and product_id

Create final dataframe for engineered features

Create training dataset

Create testing dataset

Build X_train and y_train dataset

2. Model Optimization

3. Model Training

4. Model Interpretability and Explainability

5. Model Persisting (Optional)

6. Model Deployment (Optional)

7. Model Testing